# Visual Question Answering
Spaceom GGUF
Apache-2.0
SpaceOm-GGUF is a multimodal model focusing on visual question answering tasks and performs excellently in spatial reasoning.
Text-to-Image English
S
mgonzs13
196
1
Gemma 3 12b It Qat Int4 GGUF
Gemma 3 is Google's lightweight open model series based on Gemini technology. The 12B version employs Quantization-Aware Training (QAT) technology, supports multimodal input, and features a 128K context window.
Text-to-Image
G
unsloth
1,921
3
My Model
MIT
GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.
Image-to-Text
PyTorch Supports Multiple Languages
M
anoushhka
87
0
Vora 7B Instruct
VoRA is a vision-language model based on 7B parameters, focusing on image-text-to-text conversion tasks.
Image-to-Text
Transformers

V
Hon-Wong
154
12
Sapnous VR 6B
Apache-2.0
Sapnous-6B is an advanced vision-language model that enhances perception and understanding of the world through powerful multimodal capabilities.
Image-to-Text
Transformers English

S
Sapnous-AI
261
5
Gemma 3 12b It GGUF
Gemma 3 is a lightweight open-source multimodal model series launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs
Image-to-Text
G
ggml-org
8,110
23
Gemma 3 27b It
Gemma is a lightweight cutting-edge open model series launched by Google, built on the same technology as Gemini, supporting multimodal input and text output.
Image-to-Text
Transformers

G
google
371.46k
1,274
Smolvlm2 500M Video Instruct
Apache-2.0
A lightweight multimodal model designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.
Image-to-Text
Transformers English

S
HuggingFaceTB
17.89k
56
Smolvlm2 256M Video Instruct
Apache-2.0
SmolVLM2-256M-Video is a lightweight multimodal model specifically designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.
Image-to-Text
Transformers English

S
HuggingFaceTB
22.16k
53
Qwen2.5 VL 7B Instruct Quantized.w8a8
Apache-2.0
Quantized version of Qwen2.5-VL-7B-Instruct, supporting vision-text input and text output, optimized for inference efficiency through INT8 weight quantization
Image-to-Text
Transformers English

Q
RedHatAI
1,992
3
Qwen2.5 VL 7B Instruct FP8 Dynamic
Apache-2.0
The FP8 quantized version of Qwen2.5-VL-7B-Instruct, supporting efficient vision-text inference through vLLM
Text-to-Image
Transformers English

Q
RedHatAI
25.18k
1
Qwen2.5 VL 3B Instruct FP8 Dynamic
Apache-2.0
The FP8 quantized version of Qwen2.5-VL-3B-Instruct, supporting visual-text input and text output, and optimizing inference efficiency.
Text-to-Image
Transformers English

Q
RedHatAI
112
1
Llamav O1
Apache-2.0
LlamaV-o1 is an advanced multimodal large language model specifically designed for complex visual reasoning tasks, optimized through curriculum learning techniques, demonstrating outstanding performance across diverse benchmarks.
Text-to-Image
Safetensors English
L
omkarthawakar
1,406
93
Microsoft Git Base
MIT
GIT is a Transformer-based generative image-to-text model capable of converting visual content into textual descriptions.
Image-to-Text Supports Multiple Languages
M
seckmaster
18
0
Paligemma2 3b Pt 896
PaliGemma 2 is a multimodal vision-language model that combines image and text inputs to generate text outputs. It supports multiple languages and is suitable for various vision-language tasks.
Image-to-Text
Transformers

P
google
2,536
22
Dermatech Qwen2 VL 2B
A dermatology-specific diagnostic model fine-tuned using LoRA technology based on Qwen2-VL-2B-Instruct, capable of analyzing skin condition images and providing professional diagnostic descriptions.
Image-to-Text
Transformers

D
Rewatiramans
60
3
Florence 2 FT Lung Cancer Detection
A lung cancer detection model fine-tuned based on Florence-2-base-ft, identifying lung cancer types through lung images
Text-to-Image
Transformers English

F
nirusanan
20
1
Peacock
Other
The Peacock Model is an Arabic multimodal large language model based on the InstructBLIP architecture, with AraLLaMA as its language model.
Image-to-Text Arabic
P
UBC-NLP
73
1
Qwen Vl Guidance
Apache-2.0
GUIChat is a multimodal model based on Visual Question Answering (VQA), capable of understanding image content and answering related questions, specifically optimized for GUI element recognition and interaction.
Text-to-Image
Transformers

Q
RhapsodyAI
46
2
Horus OCR
Donut is a Transformer-based image-to-text model capable of extracting and generating textual content from images.
Image-to-Text
Transformers

H
TeeA
21
0
Paligemma 3B Chat V0.2
A multimodal dialogue model fine-tuned based on google/paligemma-3b-mix-448, optimized for multi-turn conversation scenarios
Text-to-Image
Transformers Supports Multiple Languages

P
BUAADreamer
80
9
Paligemma Vqav2
This model is a fine-tuned version of google/paligemma-3b-pt-224 on a subset of the VQAv2 dataset, specializing in visual question answering tasks.
Text-to-Image
Transformers

P
merve
168
13
360VL 8B
Apache-2.0
360VL is a multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual dialogue capabilities.
Text-to-Image
Transformers Supports Multiple Languages

3
qihoo360
22
13
Llava Llama 3 8b
Other
A large multimodal model trained based on the LLaVA-v1.5 framework, using the 8-billion-parameter Meta-Llama-3-8B-Instruct as the language backbone and equipped with a CLIP-based visual encoder.
Image-to-Text
Transformers

L
Intel
387
14
Llava NeXT Video 7B DPO
LLaVA-Next-Video is an open-source multimodal dialogue model, fine-tuned with multimodal instruction-following data on large language models, supporting video and text multimodal interactions.
Text-to-Video
Transformers

L
lmms-lab
8,049
27
Uform Gen2 Dpo
Apache-2.0
UForm-Gen2-dpo is a small generative vision-language model, aligned for image caption generation and visual question answering tasks through Direct Preference Optimization (DPO) on VLFeedback and LLaVA-Human-Preference-10K preference datasets.
Image-to-Text
Transformers English

U
unum-cloud
3,568
44
Moai 7B
MIT
MoAI is a large-scale language and vision hybrid model capable of processing both image and text inputs to generate text outputs.
Image-to-Text
Transformers

M
BK-Lee
183
45
Llava Maid 7B DPO GGUF
LLaVA is a large language and vision assistant model capable of handling multimodal tasks involving images and text.
Image-to-Text
L
megaaziib
99
4
Candle Llava V1.6 Mistral 7b
Apache-2.0
LLaVA is a vision-language model capable of understanding and generating text related to images.
Image-to-Text
C
DanielClough
73
0
Llava V1.5 13b Dpo Gguf
LLaVA-v1.5-13B-DPO is a vision-language model based on the LLaVA framework, trained with Direct Preference Optimization (DPO) and converted to GGUF quantized format to improve inference efficiency.
Image-to-Text
L
antiven0m
30
0
Llava V1.6 34B Gguf
Apache-2.0
LLaVA 1.6 34B is an open-source multimodal chatbot model developed by fine-tuning a large language model on multimodal instruction-following data. It supports image-to-text and text-to-text generation tasks.
Image-to-Text
L
cjpais
1,965
40
Llava V1.6 Vicuna 13b
LLaVA is an open-source multimodal chatbot, fine-tuned on large language models with multimodal instruction-following data.
Image-to-Text
Transformers

L
liuhaotian
7,080
56
Llava V1.6 Mistral 7b
Apache-2.0
LLaVA is an open-source multimodal chatbot, trained by fine-tuning large language models on multimodal instruction-following data.
Text-to-Image
Transformers

L
liuhaotian
27.45k
236
Minicpm V
MiniCPM-V is an efficient lightweight multimodal model optimized for edge device deployment, supporting bilingual (Chinese-English) interaction and outperforming models of similar scale.
Text-to-Image
Transformers

M
openbmb
19.74k
173
Moondream1
A 1.6B-parameter multimodal model combining SigLIP and Phi-1.5 architectures, supporting image understanding and Q&A tasks
Image-to-Text
Transformers English

M
vikhyatk
70.48k
487
Med BLIP 2 QLoRA
BLIP2 is a vision-language model based on OPT-2.7B, focusing on visual question answering tasks. It can understand image content and answer related questions.
Text-to-Image
M
NouRed
16
1
Infimm Zephyr
InfiMM is a multimodal vision-language model inspired by the Flamingo architecture, integrating the latest LLM models and suitable for a wide range of vision-language processing tasks.
Image-to-Text
Transformers English

I
Infi-MM
23
10
Uform Gen Chat
Apache-2.0
UForm-Gen-Chat is the fine-tuned multimodal conversational version of UForm-Gen, primarily used for image caption generation and visual question answering tasks.
Image-to-Text
Transformers English

U
unum-cloud
65
19
Uform Gen
Apache-2.0
UForm-Gen is a small generative vision-language model primarily used for image caption generation and visual question answering.
Image-to-Text
Transformers English

U
unum-cloud
152
44
Yi VL 34B
Apache-2.0
Yi-VL-34B is an open-source multimodal model from the Yi series, capable of understanding image content and engaging in multi-turn conversations, with outstanding performance on the MMMU and CMMMU benchmarks.
Image-to-Text
Y
01-ai
150
263
- 1
- 2
- 3
Featured Recommended AI Models